Skip to content

[tx] Implement cutlass kernel for ragged_dot with group_offset#896

Open
pcmoritz wants to merge 116 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-ragged-dot-cutlass
Open

[tx] Implement cutlass kernel for ragged_dot with group_offset#896
pcmoritz wants to merge 116 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-ragged-dot-cutlass

Conversation

@pcmoritz
Copy link
Collaborator

@pcmoritz pcmoritz commented Jan 19, 2026

This brings down the step time of

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512

with

uv run --extra gpu --extra tinker -m tx.tinker.api     --base-model Qwen/Qwen3-30B-A3B     --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 8, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

from 40s to 20s. I spend some time tuning the tile sizes and also tried different tile sizes / configurations for different settings (e.g. the different projections or low k setting for LoRA), but it only made a very small difference and wouldn't be worth the complexity for now.


Open with Devin

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 8 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant